-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add CreateIndex commit type to python API #2883
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. I appreciate the well-thought-out test.
However, we should make sure to handle fragment_bitmap
. It plays an important role in the query path and maintenance of the index.
python/src/dataset.rs
Outdated
name, | ||
fields, | ||
dataset_version, | ||
fragment_bitmap: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fragment_bitmap
is essential for the maintenance of indices. We shouldn't leave it blank. You can set it to be a set()
in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I added fragment_ids to the python api.
I guess in normal case, we should just pass the fragment_ids of all fragments to it for index built over a static dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in normal case, we should just pass the fragment_ids of all fragments to it for index built over a static dataset?
Yes that's accurate. There are other cases, such as when another process has appended during the index operation, or the index operation was only performed on a subset of data.
# Check if the index is used in scans | ||
for dataset in [dataset_with_index, dataset_without_index]: | ||
scanner = dataset.scanner( | ||
fast_search=True, prefilter=True, filter="meta = 'hello'" | ||
) | ||
plan = scanner.explain_plan() | ||
assert "MaterializeIndex" in plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of shocked it is if you don't pass down the fragment bitmap. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works because there is a fallback that scans the full index to recalculate this. It's much preferable to be able to pass down fragment_bitmap
, as this scan could be slow for large indices.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2883 +/- ##
==========================================
- Coverage 78.12% 77.89% -0.24%
==========================================
Files 229 231 +2
Lines 70492 70523 +31
Branches 70492 70523 +31
==========================================
- Hits 55073 54932 -141
- Misses 12305 12459 +154
- Partials 3114 3132 +18
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
18fb870
to
98efdf5
Compare
e82b43a
to
454d0d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the fragment ids. Could we make that a set? Once that's done I think this is good to go.
python/python/lance/dataset.py
Outdated
name: str | ||
fields: List[int] | ||
dataset_version: int | ||
fragment_ids: List[int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer set
here.
fragment_ids: List[int] | |
fragment_ids: Set[int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/src/dataset.rs
Outdated
name: String, | ||
fields: Vec<i32>, | ||
dataset_version: u64, | ||
fragment_ids: Vec<u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fragment_ids: Vec<u32>, | |
fragment_ids: &PySet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
No description provided.